Introduction:

In this data analysis report, we explore the relationship between GDP per capita and life expectancy in different countries. We begin by conducting exploratory data analysis to understand the correlation between these two variables. We then move on to principal component analysis to identify the leading components that explain the variance in GDP and life expectancy data.

We utilize linear discriminant analysis and clustering techniques to investigate the data from a classification standpoint. By applying these methods, we aim to uncover the underlying clustering patterns present within the continents.

Additionally, we perform hypothesis testing to determine if GDP and life expectancy are equal for Asia and Europe. Lastly, we use linear discriminant analysis and linear regression models to predict life expectancy based on GDP values.

gap.raw <- read.csv('gap.csv')
gap <- gap.raw

gap[,3:14] <- log(gap.raw[,3:14])

gdp <- exp(gap[,3:14])
years <- seq(1952, 2007,5)
colnames(gdp) <- years
rownames(gdp) <- gap[,2]

lifeExp <- gap[,15:26]
colnames(lifeExp) <- years
rownames(lifeExp) <- gap[,2]

Exploratory data analysis

I start my analysis with the relation between both variables life expectancy and GDP. It is expected that as a country has a higher GDP it will have more resources to invest in health and a healthy lifestyle. People living in poor countries will not have access to healthy expensive food and expensive health treatments, all those things can increase or decrease the life expectancy of people. As we see on the next plot Kuwait and Saudi Arabia have a behavior different from other countries, that is why I remove them to see a better plot. However, as removing outliers requires further analysis out of the scope of this work, I will keep them for the rest of this course work.

library(tidyr)
library(ggplot2)
library(ggrepel)
x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
ggplot(data, aes(x=GDP, y=LifeExp)) + geom_point() + geom_smooth() +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: ggrepel: 1678 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

#library(car)
x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
data = data[!(data$country=="Kuwait"|data$country=="Saudi Arabia"),]
ggplot(data, aes(x=GDP, y=LifeExp)) + geom_point() + geom_smooth() +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: ggrepel: 1632 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# par(mfrow=c(4,3))
# for(i in rownames(gdp)){
#   plot(x[,i], y[,i], ylab="Life Exp.", xlab="GDP")
# }

It is very interesting that if we apply a variance-stabilizing transformation to GDP, in this case logarithmic , we obtain a much linear curve. Also as we see on the next plot the variance is not constant (heteroskedasticity).

x=pivot_longer(as.data.frame(t(lifeExp)), Algeria:"New Zealand")
y=pivot_longer(as.data.frame(t(gdp)), Algeria:"New Zealand")
data = data.frame(x[,c("name","value")],y[,"value"])
colnames(data)=c("country","LifeExp","GDP")
data = data[!(data$country=="Kuwait"|data$country=="Saudi Arabia"),]
ggplot(data, aes(x=log(GDP), y=LifeExp)) + geom_point() + geom_smooth(method = lm) +
geom_text_repel(aes(label=country), force = 3, max.overlaps = 50)
## `geom_smooth()` using formula 'y ~ x'
## Warning: ggrepel: 1578 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

From this analysis I conclude that GDP could be a good predictor of life expectancy as they are highly correlated.

Principal component analysis

To carry out the PCA with R we can use either prcomp and princomp, however, prcomp is more recommended because it works doing singular value decomposition and this method has higher numerical accuracy.

gdp.pca = prcomp(log(gdp), scale = TRUE)
lifeExpectancy.pca = prcomp(lifeExp, scale = TRUE)

Here I calculate the proportion of variation explained by each of the principal components, and provide a scree plot.

summary(gdp.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.3592 0.73033 0.32047 0.18149 0.13622 0.09627 0.07515
## Proportion of Variance 0.9404 0.04445 0.00856 0.00274 0.00155 0.00077 0.00047
## Cumulative Proportion  0.9404 0.98481 0.99337 0.99612 0.99766 0.99844 0.99891
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.06154 0.05953 0.04728 0.04439 0.03990
## Proportion of Variance 0.00032 0.00030 0.00019 0.00016 0.00013
## Cumulative Proportion  0.99922 0.99952 0.99970 0.99987 1.00000
screeplot(gdp.pca, type="lines",col="blue", main="Variance explained by PC (GDP)")
title(xlab="Principal Components")

summary(lifeExpectancy.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     3.3284 0.8219 0.39169 0.21950 0.1250 0.1098 0.08778
## Proportion of Variance 0.9232 0.0563 0.01279 0.00402 0.0013 0.0010 0.00064
## Cumulative Proportion  0.9232 0.9795 0.99229 0.99630 0.9976 0.9986 0.99925
##                            PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.06709 0.04413 0.03673 0.02815 0.02024
## Proportion of Variance 0.00038 0.00016 0.00011 0.00007 0.00003
## Cumulative Proportion  0.99963 0.99979 0.99990 0.99997 1.00000
screeplot(lifeExpectancy.pca, type="lines",col="blue", main="Variance explained by PC (Life Expectancy)")
title(xlab="Principal Components")

GDP proportion of variation

As we see in the R output PC1 explains 94 percent of the variance, PC2 explains 4.4% of the variance, PC3 0.85% of the variance, and so on. If I want to retain more than 90% of the data it would be enough to keep just PC1, but with adding PC2 I retain 98% of the variability of the data. Heuristically analyzing the screeplot using the elbow rule I would say that until 3 PCs it is still worth it to retain because when I pass from 2 to three PCs the variance still changes, but passing from three to four and so on the changes are not significant to my eyes.

Life Expectancy proportion of variation

The screeplot in this case is pretty similar to the one obtained for GDP. 92.32 %, 5.63%, and 1.279 % of the variance are explained by PC1, PC2, and PC3 respectively. For this data, I also think that it is worth it to retain three PCs at most.

Interpretation for each of leading principal components of log(GDP)

Something very interesting is that I can calculate the correlation between PCs and variables and then understand easier the meaning of each PCs. This is what I do in the next code chunk and I also show the Rotation matrix. Conclusions are identical using the rotation matrix though.

GDP PC1 represents basically GDP, PC1 has a high negative correlation with all the GDP between 1952 and 2007. As PC1 increases the GDP becomes lower. PC2 is representing GDP between 1952 and 1977 positively and between 1982 to 2007 negatively. On the other hand countries with a high PC3 will have high values of GDP between 1952 and 1962 and between 1997 and 2007.

Life Expectancy PC1 has a high positive correlation to life expectancy, this means that as a country has a high PC1, its life expectancy will be high as well. PC2 represents positively the values of life expectancy between 1952 and 1982, and conversely negatively from 1987 and 2007. PC3 represents low values of life expectancy from 1952 to 1967 and between 2002 and 2007, but the opposite for the years between 1972 and 1997.

print("GDP rotation:")
 [1] "GDP rotation:"
gdp.pca$rotation[,1:3]
##             PC1         PC2         PC3
## 1952 -0.2807487  0.40051233  0.43574977
## 1957 -0.2848414  0.37055497  0.28293152
## 1962 -0.2888408  0.31874715  0.10896406
## 1967 -0.2910711  0.24625845 -0.11432067
## 1972 -0.2930171  0.16692054 -0.27940184
## 1977 -0.2944425  0.05722029 -0.36986452
## 1982 -0.2941942 -0.04581450 -0.40967555
## 1987 -0.2932798 -0.14621010 -0.30610546
## 1992 -0.2910449 -0.24525213 -0.06508071
## 1997 -0.2871994 -0.33718777  0.14420739
## 2002 -0.2840643 -0.38710054  0.25995158
## 2007 -0.2808777 -0.40215905  0.36894940
cor(log(gdp),gdp.pca$x)
##             PC1         PC2         PC3         PC4           PC5          PC6
## 1952 -0.9430974  0.29250776  0.13964394  0.05725404  0.0101602689 -0.026886032
## 1957 -0.9568454  0.27062889  0.09067055  0.03226023  0.0151704111  0.002979152
## 1962 -0.9702806  0.23279188  0.03491952 -0.02343792  0.0089531003  0.024451045
## 1967 -0.9777725  0.17985091 -0.03663614 -0.07999819 -0.0272921424  0.041908503
## 1972 -0.9843095  0.12190774 -0.08953940 -0.05009606 -0.0570389038 -0.008198222
## 1977 -0.9890979  0.04178992 -0.11852981 -0.02433137  0.0071909926 -0.066601827
## 1982 -0.9882638 -0.03345988 -0.13128798  0.02390850  0.0539393316 -0.004764855
## 1987 -0.9851922 -0.10678221 -0.09809706  0.05496732  0.0534204639  0.037494612
## 1992 -0.9776844 -0.17911597 -0.02085630  0.08317054 -0.0524172159  0.013390365
## 1997 -0.9647666 -0.24625969  0.04621388  0.03364496 -0.0580195297 -0.005921415
## 2002 -0.9542352 -0.28271267  0.08330621 -0.03034894 -0.0001959895 -0.003526780
## 2007 -0.9435305 -0.29371042  0.11823655 -0.07749828  0.0470637365 -0.004763317
##               PC7          PC8          PC9          PC10          PC11
## 1952  0.008782268 -0.033852799  0.010181361 -0.0026470059 -0.0055072634
## 1957  0.003604402  0.032315919 -0.001056862 -0.0001880523  0.0153835923
## 1962 -0.021628834  0.022098960 -0.012610536  0.0088033326 -0.0154063239
## 1967 -0.014810550 -0.027653594 -0.011183383 -0.0102816785  0.0058569204
## 1972  0.038497121  0.008761483  0.027521005  0.0093382743  0.0002343443
## 1977 -0.011023208  0.006878510 -0.017291106 -0.0176042027 -0.0053997734
## 1982 -0.015626522 -0.013444835  0.002693047  0.0304410341  0.0105202212
## 1987  0.011823888  0.003593338  0.017406019 -0.0244498912 -0.0043296402
## 1992  0.023383443 -0.002823566 -0.032767219  0.0070424296 -0.0087635373
## 1997 -0.034081572  0.003160905  0.012276381 -0.0062930720  0.0205680231
## 2002 -0.017857197  0.002332744  0.019128104  0.0058902271 -0.0250849119
## 2007  0.029251169 -0.001689084 -0.014071677 -0.0000235639  0.0120149131
##              PC12
## 1952 -0.006153628
## 1957  0.020268310
## 1962 -0.022063082
## 1967  0.011061548
## 1972 -0.004719186
## 1977  0.001789999
## 1982  0.002280300
## 1987 -0.004208394
## 1992  0.004074155
## 1997 -0.011301594
## 2002  0.016766348
## 2007 -0.007747656
print("Life Expectancy rotation:")
## [1] "Life Expectancy rotation:"
lifeExpectancy.pca$rotation[,1:3]
##            PC1         PC2         PC3
## 1952 0.2837703  0.34441629 -0.32952972
## 1957 0.2880971  0.31933803 -0.24260282
## 1962 0.2898354  0.29660608 -0.18199203
## 1967 0.2937536  0.24074227 -0.05922854
## 1972 0.2956429  0.17850314  0.10463448
## 1977 0.2949868  0.11324248  0.26337143
## 1982 0.2969332  0.02533604  0.33250651
## 1987 0.2954179 -0.07504642  0.39549540
## 1992 0.2877706 -0.24266370  0.40007959
## 1997 0.2849288 -0.37063340  0.01000402
## 2002 0.2774799 -0.43388258 -0.34334220
## 2007 0.2744526 -0.44497039 -0.41302205
cor(lifeExp,lifeExpectancy.pca$x)
##            PC1         PC2          PC3          PC4           PC5          PC6
## 1952 0.9445114  0.28308016 -0.129073360 -0.081516551 -3.091603e-02  0.055308506
## 1957 0.9589126  0.26246802 -0.095024996 -0.038755272 -3.450243e-05  0.002216994
## 1962 0.9646986  0.24378433 -0.071284382 -0.019571044  1.245202e-02 -0.038799445
## 1967 0.9777400  0.19786915 -0.023199203  0.026140632  1.949406e-02 -0.050944635
## 1972 0.9840284  0.14671402  0.040984237  0.076103503  3.661604e-02 -0.017408759
## 1977 0.9818447  0.09307544  0.103159845  0.102333414  3.846571e-02  0.065726354
## 1982 0.9883232  0.02082401  0.130239335  0.030117317 -6.088414e-02 -0.005318526
## 1987 0.9832796 -0.06168162  0.154911429 -0.004043822 -6.581306e-02 -0.011403151
## 1992 0.9578261 -0.19944841  0.156707006 -0.121255827  4.005493e-02 -0.003383970
## 1997 0.9483673 -0.30462833  0.003918472 -0.057458491  3.239171e-02  0.006503466
## 2002 0.9235740 -0.35661364 -0.134483564  0.030642948  2.933252e-03  0.004283890
## 2007 0.9134979 -0.36572687 -0.161776436  0.053264933 -2.468057e-02 -0.005409316
##                PC7           PC8           PC9          PC10          PC11
## 1952 -0.0059278336  0.0110563872  0.0066469564 -0.0154678213  0.0007690448
## 1957  0.0018845639  0.0121402617 -0.0098842381  0.0280745636 -0.0008372452
## 1962  0.0064981827 -0.0521319552  0.0014017756 -0.0055185955  0.0011437564
## 1967  0.0008316197  0.0214847575  0.0021788444 -0.0015375341 -0.0030969670
## 1972 -0.0074500924  0.0259279203  0.0068926167 -0.0107843609  0.0029004930
## 1977 -0.0033764876 -0.0182190063  0.0009129623  0.0057714146 -0.0022045050
## 1982  0.0140678450  0.0011489704 -0.0304045534 -0.0074229448  0.0001554075
## 1987  0.0041032175 -0.0025539314  0.0279861569  0.0085457714  0.0012303556
## 1992 -0.0430490938 -0.0006630837 -0.0060830910 -0.0003850919  0.0046768283
## 1997  0.0565006235  0.0045132268  0.0020942458 -0.0027733617 -0.0128406307
## 2002  0.0187732112  0.0011991897 -0.0006707001  0.0017114280  0.0205868393
## 2007 -0.0440835909 -0.0041617339 -0.0010622661 -0.0002048975 -0.0124437763
##               PC12
## 1952 -0.0016576278
## 1957  0.0047891612
## 1962  0.0026764063
## 1967 -0.0140152253
## 1972  0.0121587841
## 1977 -0.0046889737
## 1982  0.0005064825
## 1987  0.0004149916
## 1992 -0.0009218386
## 1997  0.0017238358
## 2002 -0.0022175237
## 2007  0.0012275826

Scatter plots and analysis of combinations of the first three principal component scores

gdp_pca = data.frame(PC1=gdp.pca$x[,1], PC2=gdp.pca$x[,2], PC3=gdp.pca$x[,3], continent=gap[,1], country=gap[,2])
lifeExpectancy_pca = data.frame(PC1=lifeExpectancy.pca$x[,1], PC2=lifeExpectancy.pca$x[,2], PC3=lifeExpectancy.pca$x[,3], continent=gap[,1], country=gap[,2])

ggplot(gdp_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC1 and PC2")

ggplot(gdp_pca, aes(x=PC2, y=PC3, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC2 and PC3")

Kuwait and Congo Dem. are particularly interesting as Kuwait has PC3 and PC2 high, but Congo Dem has high PC2 and low PC3. In my previous interpretations of the PCs I said that if PC2 is high also would be GDP between 1952 and 1977, but low between 1982 and 2007, and indeed it is what I see in the next plot for Kuwait, although a high PC3 would mean a high value between 1997 and 2007 this is opposite to PC2, and as PC2 has a stronger correlation it justifies a positive trend between 1997 and 2007. Another important thing is that as PC1 is higher for Congo Rep. Dem. than for Kuwait, the opposite happens for GDP. The total GDP is 131.4858 and 76.51869 for Kuwait and Congo Rep. Dem. which is consistent.

gdp = gap[,3:14] # gdp
lifeExp = gap[,15:26] # life exp
rownames(gdp) = gap[,2]
rownames(lifeExp) = gap[,2]
#par(mfrow=c(4,3))
plot(as.numeric(gdp["Kuwait",]), x=years, ylab="GDP", main=paste("GDP","Kuwait"), type="l")

plot(as.numeric(gdp["Congo Dem. Rep.",]), x=years, ylab="GDP", main=paste("GDP","Congo Dem. Rep."), type="l")

sum(gdp["Kuwait",])
## [1] 131.4858
sum(gdp["Congo Dem. Rep.",])
## [1] 76.51869
ggplot(lifeExpectancy_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("Life expectancy, PC1 and PC2")

ggplot(lifeExpectancy_pca, aes(x=PC2, y=PC3, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("Life expectancy, PC2 and PC3")

Analyzing Life Expectancy for Zimbabwe and Oman, I see that Oman has a slightly higher PC1 and then also Life Expectancy than Zimbabwe. As Zimbabwe has a high PC2 it means a high value in its figure the first half of the graph (1952 to 1982) and a lower value for the second half(1987 to 2007), however, Oman had the opposite value for this figure for both periods. Both countries have a high value for PC3, and this value affects negatively at the beginning and at the end of the series (between 1952 and 1967 and between 2002 to 2007), but positively between 1972 and 1997. This is coincident with the series of the next plots.

gdp = gap[,3:14] # gdp
lifeExp = gap[,15:26] # life exp
rownames(gdp) = gap[,2]
rownames(lifeExp) = gap[,2]
plot(as.numeric(lifeExp["Zimbabwe",]), x=years, ylab="Life exp.", main=paste("Life exp.","Zimbabwe"), type="l")

plot(as.numeric(lifeExp["Oman",]), x=years, ylab="Life exp.", main=paste("Life exp.","Oman"), type="l")

sum(lifeExp["Zimbabwe",])
## [1] 631.958
sum(lifeExp["Oman",])
## [1] 701.312

Classical Multidimensional scaling

lifeExpGdp = gap[,3:26]
rownames(lifeExpGdp) <- gap[,2]
lifeExpGdp.dist=dist(lifeExpGdp, upper = TRUE)
lifeExpGdp.cmd=cmdscale(lifeExpGdp.dist, eig = FALSE, k = 2)
lifeExpGdp_cmd=data.frame(cmd1=lifeExpGdp.cmd[,1], cmd2=lifeExpGdp.cmd[,2], continent=gap[,1], country=gap[,2])
ggplot(lifeExpGdp_cmd, aes(x=cmd1, y=cmd2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("CMD, Life Expectancy and GDP")

Comparison between Classical Multidimensional Analysis and Principal Components Analysis

Doing a Comparison of the plots obtained with CMD and PCA, I see that CMD has a huge advantage in terms of compactness, as we are reducing all the data in one graph with just two dimensions, and also the level of separation of the clusters is similar. However, I see that PCA is superior is in terms of interpretability and still having compact results.

Hotelling's two sample T2-test Hypothesis testing

In Hotelling's two sample T2-test, the null hypothesis (H0) states that there is no difference between the means of two independent samples. The alternative hypothesis (H1) states that there is a significant difference between the means of the two samples.

The p-value in a statistical test represents the probability of observing a test statistic as extreme as the one calculated, given that the null hypothesis is true. A p-value less than the alpha level (typically 0.05) indicates that there is enough evidence to reject the null hypothesis.

Therefore, in Hotelling's two sample T2-test, if the calculated p-value is less than 0.05, we would reject the null hypothesis and conclude that there is a significant difference between the means of the two samples.

X1=gap%>%filter(continent=="Asia")%>%select(gdpPercap_2007,lifeExp_2007)
X2=gap%>%filter(continent=="Europe")%>%select(gdpPercap_2007,lifeExp_2007)
HotellingsT2(X1, X2)
## 
##  Hotelling's two sample T2-test
## 
## data:  X1 and X2
## T.2 = 12.681, df1 = 2, df2 = 60, p-value = 2.55e-05
## alternative hypothesis: true location difference is not equal to c(0,0)
print(paste("Critical value for α=0.05 is",qf(0.95,2,60)))
## [1] "Critical value for α=0.05 is 3.15041131058273"

\(H_0: μ_1=μ_2\) vs \(H_1: μ_2≠μ_2\) The hypothesis \(H_0\) is that GDP and life expectancy in 1952 were equal for Asia and Europe.In this case, the p-value is very low then there is huge evidence of a difference between Asia and Europe in terms of GDP and life expectancy in the year 1952. The critical value lower than the T.2=39 supports this hypothesis as well.

X1=gap%>%filter(continent=="Asia")%>%select(gdpPercap_1952,lifeExp_1952)
X2=gap%>%filter(continent=="Europe")%>%select(gdpPercap_1952,lifeExp_1952)
HotellingsT2(X1, X2)
## 
##  Hotelling's two sample T2-test
## 
## data:  X1 and X2
## T.2 = 39.347, df1 = 2, df2 = 60, p-value = 1.21e-11
## alternative hypothesis: true location difference is not equal to c(0,0)
print(paste("Critical value for α=0.05 is",qf(0.95,2,60)))
## [1] "Critical value for α=0.05 is 3.15041131058273"

For the year 1952, the p-value is even lower, then I could say that they were more different then.

Linear discriminant analysis

For the linear discriminant analysis First I use 30% of the data for the test set (42 examples approximately) and the rest for the training set.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
set.seed(123) # so that I get the same results each time.
test.index <- sample(1:142, size=42)

gap.test <- dplyr::select(gap, -country)[test.index,]
gap.train <- dplyr::select(gap, -country)[-test.index,]

gap.lda<-lda(continent ~ ., gap.train)
gap.pred <- predict(gap.lda, gap.test)
print(paste("The predictive accuracy is ", 
sum(gap.pred$class == gap.test$continent)/dim(gap.test)[1]*100, "%"))
## [1] "The predictive accuracy is  57.1428571428571 %"

The quite low predictive accuracy obtained in this case may be explained by the highly unbalanced data, where, as we see in the next R output, Oceania represents just 0.9% and Africa 36.8% of the data.

gap.lda
## Call:
## lda(continent ~ ., data = gap.train)
## 
## Prior probabilities of groups:
##   Africa Americas     Asia   Europe  Oceania 
##     0.38     0.20     0.21     0.20     0.01 
## 
## Group means:
##          gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
## Africa         6.811431       6.896881       6.996266       7.134151
## Americas       8.061785       8.170572       8.255846       8.385747
## Asia           7.486033       7.629678       7.732015       7.833065
## Europe         8.440622       8.669796       8.862741       9.078023
## Oceania        9.214292       9.301063       9.410602       9.583704
##          gdpPercap_1972 gdpPercap_1977 gdpPercap_1982 gdpPercap_1987
## Africa         7.255926       7.297963       7.306340       7.251569
## Americas       8.540909       8.675896       8.683815       8.683559
## Asia           8.035114       8.167184       8.207409       8.201288
## Europe         9.305521       9.444677       9.527915       9.606729
## Oceania        9.728457       9.816523       9.876990       9.993734
##          gdpPercap_1992 gdpPercap_1997 gdpPercap_2002 gdpPercap_2007
## Africa         7.237518       7.278828       7.339940       7.460726
## Americas       8.721062       8.796606       8.818189       8.950151
## Asia           8.285280       8.423720       8.507930       8.704134
## Europe         9.515321       9.654124       9.804083       9.978058
## Oceania       10.061549      10.203516      10.331619      10.446839
##          lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972
## Africa       39.11326     41.19161     43.28484     45.35403     47.53913
## Americas     52.05085     54.85880     57.36660     59.55180     61.60320
## Asia         46.48762     49.48419     51.54983     54.85820     57.54352
## Europe       64.54480     66.76775     68.78605     70.01170     71.05750
## Oceania      69.12000     70.33000     70.93000     71.10000     71.93000
##          lifeExp_1977 lifeExp_1982 lifeExp_1987 lifeExp_1992 lifeExp_1997
## Africa       49.78050     51.74676     53.35676     53.65358     53.13255
## Americas     63.62155     65.55095     67.58430     69.19275     70.86310
## Asia         60.00283     62.51476     64.66881     66.21510     67.80638
## Europe       72.19850     73.12010     73.93275     74.57110     75.67085
## Oceania      73.49000     74.74000     76.32000     77.56000     78.83000
##          lifeExp_2002 lifeExp_2007
## Africa       52.33692     53.82224
## Americas     72.24320     73.44930
## Asia         68.99995     70.54000
## Europe       76.93105     77.88905
## Oceania      80.37000     81.23500
## 
## Coefficients of linear discriminants:
##                        LD1         LD2         LD3         LD4
## gdpPercap_1952 -0.36150213 -1.44996671  6.01673022  1.93345865
## gdpPercap_1957 -0.30188638  2.83501245 -2.00963775 -4.18458786
## gdpPercap_1962  2.91908061 -5.12402027 -5.65144860  4.34535216
## gdpPercap_1967 -1.85018480  1.46782661  5.56565210 -1.54204562
## gdpPercap_1972 -1.94550293  2.72667850 -4.82904106  1.74041776
## gdpPercap_1977  2.20684661 -1.32238060  2.13900549 -3.72196862
## gdpPercap_1982 -2.10106879  1.07987750 -3.19438747  3.21563908
## gdpPercap_1987  2.86928446  2.32255546  1.41979461 -2.12055524
## gdpPercap_1992 -1.68799301 -2.76358361  2.77325600 -0.84987614
## gdpPercap_1997 -0.04424094 -1.68224695 -2.04000218  0.66252402
## gdpPercap_2002  0.21653260  1.89425339  4.90532571 -2.61462903
## gdpPercap_2007  0.28121048 -0.11918190 -4.72039347  2.95403657
## lifeExp_1952   -0.08756566  0.76457722 -0.64102214 -0.13227001
## lifeExp_1957    0.07809968 -0.88840067  0.82378694  1.22463806
## lifeExp_1962    0.15781643  0.04430716  0.49443440 -0.16046328
## lifeExp_1967   -0.14964503  0.39955938 -1.14871006 -1.93747242
## lifeExp_1972    0.10912394  0.01809590  0.08758833  0.91871801
## lifeExp_1977    0.15861272 -0.36937492  1.08525003  0.32980109
## lifeExp_1982   -0.16325452  0.40201905 -0.92099797 -0.89695407
## lifeExp_1987   -0.09602775 -0.47149532  0.27608810  0.91267463
## lifeExp_1992    0.04242214  0.09950723 -0.03822550 -0.13169316
## lifeExp_1997   -0.03104650  0.08066902 -0.09532618 -0.09088631
## lifeExp_2002    0.10060555  0.03714213  0.40587834 -0.04735895
## lifeExp_2007    0.03443748 -0.14497433 -0.33056848  0.08294817
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.7426 0.1739 0.0677 0.0157

The proportion of the trace shows that LD1 and LD2 are doing most of the separation and in the next table we see that we misclassify Africa as Asia two times, Americas as Africa 1, Americas as Asia 3 and America as Europe 1, and so on.

table(gap.pred$class, gap.test$continent)
##           
##            Africa Americas Asia Europe Oceania
##   Africa       11        0    2      0       0
##   Americas      2        2    6      2       0
##   Asia          1        1    3      0       0
##   Europe        0        2    1      8       1
##   Oceania       0        0    0      0       0

Plot of the 2d projection of the data onto the first two eigenvectors found by Fisher’s discriminant analysis approach.

Discussing the difference between this plot and the plot I found using PCA. For this graph, in order to do a better visual comparison with the previous graph, I do LDA with all the data instead of splitting it into training and test set. As the objective of PCA is to maximize the variance of the projected data on the PCA graph data looks more disperse, by contrast, as LDA maximizes the variance between groups minimizing variance within, the groups look more compact.

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
gap.lda.full = lda(continent ~ ., gap[,-c(2)])
gap.pred.full <- predict(gap.lda.full, gap[,-c(2)])
lda_full=data.frame(LD1=gap.pred.full$x[,1], LD2=gap.pred.full$x[,2], continent=gap[,1], country=gap[,2])
lda=ggplot(lda_full, aes(x=LD1, y=LD2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("LDA, Life Expectancy and GDP")
pca=ggplot(gdp_pca, aes(x=PC1, y=PC2, color = continent, label=country)) + geom_point() + geom_text(check_overlap = TRUE) + ggtitle("GDP, PC1 and PC2")
grid.arrange(lda,pca)

Clustering

k-means clustering

For this problem, as I know that there are 5 continents, 5 clusters are what I expect. Also looking at the plot made with the function fviz_nbclust of R with the method wss(total within sum of square) which is a heuristic method, the use of 5 clusters makes sense.

set.seed(123)
gap2 <- gap[,3:26]
gap.k <- kmeans(gap2, centers = 5, nstart=25)
table(gap.k$cluster, gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1      2       11    8     10       0
##   2      6        8   14      1       0
##   3     23        2    6      0       0
##   4      0        4    4     19       2
##   5     21        0    1      0       0
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(gap2, kmeans, method = "wss")

fviz_cluster(gap.k, gap2, ellipse.type = "norm")

Aglomerative hierarchical clustering

gap.scaled <- gap
gap.scaled[,3:26] <- scale(gap[,3:26])

In this section, hierarchical clustering single, complete, ward, and average are applied to the data. On the next R outputs, we see that single clustering performs very bad as it is classifying most of the data in just one cluster. Complete performs slightly better, but still from the first to the fourth clusters contain elements from Africa, America, and Asia. Ward and average also aren’t performing better than k-means either, concluding that countries do not naturally cluster by continent.

gap.single <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="single")
plot(gap.single, labels=gap$continent,cex=0.7)

table(cutree(gap.single, k=5), gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1     49       25   32     30       2
##   2      1        0    0      0       0
##   3      1        0    0      0       0
##   4      1        0    0      0       0
##   5      0        0    1      0       0
gap.complete <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="complete")
plot(gap.complete, labels=gap$continent,cex=0.7)

table(cutree(gap.complete, k=5), gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1      9       11    9      3       0
##   2     31        1    5      0       0
##   3      8        0    7      0       0
##   4      4        8    6      4       0
##   5      0        5    6     23       2
gap.ward <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="ward.D2")
plot(gap.ward, labels=gap$continent,cex=0.7)

table(cutree(gap.ward, k=5), gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1     11        9   10      1       0
##   2     23        1    7      0       0
##   3     16        0    5      0       0
##   4      2        6    4      2       0
##   5      0        9    7     27       2
gap.average <- hclust(dist(gap.scaled[,3:26],method="euclidean"),method="average")
plot(gap.average, labels=gap$continent,cex=0.7)

table(cutree(gap.average, k=5), gap$continent)
##    
##     Africa Americas Asia Europe Oceania
##   1     13        7   16      1       0
##   2      1        0    0      0       0
##   3     36        1    7      0       0
##   4      2       17    9     29       2
##   5      0        0    1      0       0

Linear regression

I compared Ordinary Least Squares, Principal Components and Ridge Regression. I trained these three models with the training set and then tested them with the test set obtaining the RMSEP, the model with the lowest RMSEP is then trained with the full dataset to predict the 2007 life expectancy from the GDP values.

Ordinary Least Squares

gap4Regression.training=gap[-test.index,c(26,3:14)]
gap4Regression.test=gap[test.index,c(26,3:14)]
gap4Regression.full=gap[,c(26,3:14)]
ols=lm(lifeExp_2007~.,data=gap4Regression.training)
ols_predict = predict(ols, gap4Regression.test)
mean((ols_predict-gap4Regression.test$lifeExp_2007)^2)
## [1] 36.09367

Principal Components Regression Using the PCR function from the pls package R calculates this form me.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
pcr_fit <- pcr(lifeExp_2007~., data=gap4Regression.training, validation='CV')
summary(pcr_fit)
## Data:    X dimension: 100 12 
##  Y dimension: 100 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           12.11    8.292    7.882    8.018    8.352    8.340    8.396
## adjCV        12.11    8.280    7.867    7.993    8.302    8.286    8.337
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       8.523    8.510    8.727     8.706     8.607     8.826
## adjCV    8.456    8.429    8.648     8.630     8.519     8.724
## 
## TRAINING: % variance explained
##               1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X               94.76    98.70     99.4    99.65    99.78    99.87    99.90
## lifeExp_2007    54.86    59.71     60.4    61.54    62.18    62.40    62.86
##               8 comps  9 comps  10 comps  11 comps  12 comps
## X               99.93    99.96     99.97     99.99    100.00
## lifeExp_2007    63.26    63.29     64.08     65.86     65.86
plot(RMSEP(pcr_fit), legendpos = "topright")

As the lowest cross-validation error is with two components are minimizing the RMSEP, I am using 2 of them.

pcr_predict = predict(pcr_fit, gap4Regression.test[2:13], ncomp=2)
mean((pcr_predict-gap4Regression.test$lifeExp_2007)^2)
## [1] 29.54078

##Ridge Regression.

library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-1
lambdas <- 10^seq(3,-2,by=-0.1)
cv_fit <- cv.glmnet(as.matrix(gap4Regression.training[,2:13]), gap4Regression.training$lifeExp_2007, alpha = 0, lambda = lambdas)
plot(cv_fit)

cv_fit$lambda.min
## [1] 1.995262
cv_fit$lambda.1se
## [1] 50.11872
ridge_mod = glmnet(as.matrix(gap4Regression.training[,2:13]), gap4Regression.training$lifeExp_2007, alpha=0, lambda=cv_fit$lambda.1se)
ridge_pred = predict(ridge_mod, newx = as.matrix(gap4Regression.test[,2:13]))
mean((ridge_pred - gap4Regression.test$lifeExp_2007)^2)
## [1] 56.06069

Finally, an RMSEP of 29.54078 for PCA regression is the lowest value obtained.

Model RMSEP
OLS 36.09367
PCR 29.54078
Ridge 52.55654

Then I proceed to train the PCA regression model with the full data-set and obtain the predictions:

gap4Regression=gap[,c(26,3:14)]
pcr_fit <- pcr(lifeExp_2007~., data=gap4Regression.training, validation='CV')
pcr_predict = predict(pcr_fit, gap4Regression[2:13], ncomp=2)
rownames(pcr_predict)=gap[,2]
pcr_predict
## , , 2 comps
## 
##                          lifeExp_2007
## Algeria                      67.96314
## Angola                       62.00690
## Benin                        57.44125
## Botswana                     74.23988
## Burkina Faso                 56.01175
## Burundi                      50.96851
## Cameroon                     60.56682
## Central African Republic     52.57879
## Chad                         56.02340
## Comoros                      55.75138
## Congo Dem. Rep.              46.45654
## Congo Rep.                   65.06096
## Cote d'Ivoire                59.26768
## Djibouti                     59.98675
## Egypt                        67.25404
## Equatorial Guinea            67.39622
## Eritrea                      53.64180
## Ethiopia                     51.69248
## Gabon                        74.64548
## Gambia                       53.06808
## Ghana                        55.53803
## Guinea                       55.06320
## Guinea-Bissau                53.33731
## Kenya                        58.00297
## Lesotho                      58.31557
## Liberia                      50.53649
## Libya                        73.14370
## Madagascar                   54.61341
## Malawi                       53.32421
## Mali                         55.00017
## Mauritania                   59.11304
## Mauritius                    71.09131
## Morocco                      64.21723
## Mozambique                   51.22659
## Namibia                      65.53491
## Niger                        51.35890
## Nigeria                      59.51248
## Reunion                      69.05445
## Rwanda                       53.98705
## Sao Tome and Principe        58.59562
## Senegal                      57.89567
## Sierra Leone                 53.83512
## Somalia                      54.40399
## South Africa                 70.06419
## Sudan                        59.70138
## Swaziland                    66.55392
## Tanzania                     54.85076
## Togo                         55.12275
## Tunisia                      68.52896
## Uganda                       53.97149
## Zambia                       56.07777
## Zimbabwe                     52.76645
## Argentina                    71.77855
## Bolivia                      63.99159
## Brazil                       71.38798
## Canada                       80.00019
## Chile                        71.69024
## Colombia                     68.82018
## Costa Rica                   70.15309
## Cuba                         68.52489
## Dominican Republic           66.40230
## Ecuador                      69.20480
## El Salvador                  66.84224
## Guatemala                    66.90404
## Haiti                        57.06665
## Honduras                     63.76635
## Jamaica                      69.34965
## Mexico                       72.69034
## Nicaragua                    61.11579
## Panama                       70.65893
## Paraguay                     65.99960
## Peru                         68.03786
## Puerto Rico                  76.66813
## Trinidad and Tobago          73.03540
## United States                81.08037
## Uruguay                      70.54221
## Venezuela                    71.35231
## Afghanistan                  53.34948
## Bahrain                      77.58911
## Bangladesh                   55.94368
## Cambodia                     55.83928
## China                        64.20445
## Hong Kong China              81.40710
## India                        59.69407
## Indonesia                    64.06723
## Iran                         71.64267
## Iraq                         65.77304
## Israel                       78.20614
## Japan                        80.91343
## Jordan                       65.85100
## Korea Dem. Rep.              61.39613
## Korea Rep.                   77.95538
## Kuwait                       78.07161
## Lebanon                      70.45787
## Malaysia                     72.83259
## Mongolia                     62.21165
## Myanmar                      51.59343
## Nepal                        55.76550
## Oman                         79.10824
## Pakistan                     61.89211
## Philippines                  62.77680
## Saudi Arabia                 77.95701
## Singapore                    82.80819
## Sri Lanka                    63.55577
## Syria                        65.58129
## Taiwan                       79.62423
## Thailand                     69.57438
## Vietnam                      58.91653
## West Bank and Gaza           67.44063
## Yemen Rep.                   62.05297
## Albania                      65.86756
## Austria                      80.53609
## Belgium                      79.78290
## Bosnia and Herzegovina       68.46156
## Bulgaria                     70.74783
## Croatia                      73.73951
## Czech Republic               76.00763
## Denmark                      79.97832
## Finland                      79.31436
## France                       79.46689
## Germany                      79.81658
## Greece                       78.36104
## Hungary                      74.40764
## Iceland                      80.35377
## Ireland                      79.79269
## Italy                        79.31342
## Montenegro                   70.76260
## Netherlands                  80.22076
## Norway                       82.48104
## Poland                       72.96436
## Portugal                     77.31539
## Romania                      71.12079
## Serbia                       71.73820
## Slovak Republic              74.14363
## Slovenia                     77.57260
## Spain                        78.81542
## Sweden                       79.39946
## Switzerland                  80.20050
## Turkey                       69.75724
## United Kingdom               79.03436
## Australia                    79.35942
## New Zealand                  77.10131

Accuracy of the model

mean((gap4Regression.full$lifeExp_2007-pcr_predict)^2)
## [1] 49.53311

Conclusion

Our analysis reveals a strong correlation between GDP per capita and life expectancy, with higher GDP generally associated with longer life expectancy. Principal component analysis helps us understand the variability in the data and identify key components that explain the variation in GDP and life expectancy. The interpretation of the leading principal components provides valuable insights into the relationship between the variables for different countries.

Hypothesis testing is conducted to compare GDP and life expectancy between Asia and Europe, providing statistical evidence to reject the null hypothesis that both continents have equal GDP and life expentancy in 2007. Linear discriminant analysis and clustering are employed to analyze the data from a classification perspective, revealing the clustering patterns within the continents.

Finally, linear regression models are trained to predict life expectancy in 2007 based on GDP values, with principal components regression yielding the most accurate results. Overall, this report highlights the importance of economic development in improving life expectancy and provides valuable insights into the relationship between GDP per capita and life expectancy.